CONCEPT STROKE: Analytical pipeline

Author

Data Science for Health Services and Policy Research (IACS)

Introduction

Definition

CONCEPT STROKE is a study analysing the acute care received by patients with acute ischaemic stroke where the aim is to show the relevance of care pathways on outcomes (traces/trajectories) and efficiency of stroke care.

  • Participating regions: Aragón, País Vasco, Cataluña, Navarra and Valencia.

It is a two-stage design:

1- Cross-sectional data mining design

2- Quasi-experimental design comparing interventions in acute ischaemic stroke.

The Main endpoints are:

1- In the first stage, the pathway of care as it occurs in real life and the propensity of a patient to follow a specific pathway (trace).

2- In the second stage, the survival of patients 30 days and 6 months after the admission to an emergency room

Cohort

The cohort is defined as patients admitted to hospital due to acute ischaemic stroke.

  • Inclusion criteria: Patients aged 18 years or older admitted to the emergency department (or with an unplanned hospital admission) with a principal diagnosis of acute ischaemic stroke during the study period.

  • Exclusion criteria: Patients aged 17 years or younger; Patients with a diagnosis of acute haemorrhagic stroke or with other non-specific stroke diagnoses.

  • Study period: 01-01-2010 to most recent data.

Analysis plan

A- Process mining to discover and compare actual care pathways with the theoretical pathway and with those present in the participating regions.

B- Survival analysis to provide prediction of health outcomes within each pathway.

C- Hierarchical generalised additive hierarchical modelling to compare the effectiveness of interventions within pathways.

D- Economic evaluation to measure economic impact and efficiency.

Running code

Descriptive analysis

To study the observed data, we performed a small exploratory analysis of the data. First, we must convert our DataFrame data to an Event Log object. However, one of the drawbacks we may have when creating our Event Log is the granularity of the dates, as hospital dates are usually accurate to the day while emergency dates are usually accurate to the second. Therefore, we need to generate a function to check that they are correct and see if any of them do not make medical sense. We may find errors such as, for example, having emergency and hospital dates on the same day and, as they have different granularity, automatically the hospital date is ordered first or, for example, emergency discharge date is prior to the admission date, among others.

As part of the exploratory analysis it is interesting to know how many different pathways appear and the frequency of each one. This allows us to know what percentage of the pathways are among the most frequent pathways and which pathways are isolated cases.

Survival analysis

We carried out a survival analysis with the 10 most frequent traces, constructing a Kaplan-Meier curve for each of them for comparison. Subsequently, a COX model was performed to compare these pathways and observe the Hazard Ratio (HR) of each pathway compared to the rest.

A- Process Mining

For process mining we created several functions depending on the part of the process mining study, which can be divided into:

  • Process discovery

  • Conformance checking

  • Decision mining

  • Prediction

Process discovery

Process discovery attempts to find a suitable process model that describes the order of events/activities that are executed during the execution of a process.

The next step after the descriptive analysis was to build a Petri net to discover the process. For this, there are different algorithms: alpha mining, inductive mining or heuristic mining.

However, another type of graph that can be built is the Directly Follows Graph (DFG), which is a graph that, although it can be part of the discovery process, serves as a descriptive analysis as it shows all possible pathways present in the data.

To reduce the high dimensionality, one option is to filter the traces to keep only the k most frequent traces. In this case we filtered by the k=10 most frequent traces.

Conformance checking

Conformance checking is a technique for comparing a process model with an event record of the same process. However, as a first approximation, we made a comparison between the most frequent pathways and the one we established as theoretical by Jaccard similarity without taking into account the order, so that the quotient between is calculated:

  • Numerator: number of activities that coincide between each of these pathways with the standard.

  • Denominator: number of activities of the union between each one of these pathways with the regulation.

Decision mining

Decision mining allows us to know what are the main characteristics of patients that make them follow a certain path. To do this, patient characteristics are added to the Event log. Petri net is created using the inductive algorithm and the decision points of the net are observed. We see the importance of the characteristics at one decision point. It is like a decision tree at the decision point. This step may be helpful to know which variables are of importance for input into the prediction model in the next section.

Prediction

To predict which pathway a given patient should follow based on his or her characteristics, we have used the bupaR library, which makes use of a transformer model to predict the pathway as a sequence of activities. Thus, an Event log with features and a transformer model is used to predict the next activity.

B- Estimation of outcomes within a path

A proportional-hazard regression with the scores of propensity of specific paths being the main independent factor in the prediction of survival at 6 and 12 months after admission.

B.1 Kaplan-Meier survival plot

The kaplan-meier survival plot for the 4 intervention possibilities is shown below:

  • None

  • Fibrinolysis

  • Thrombectomy mechanic

  • Combined (fibrinolyisis + thrombectomy mechanic)

B.2 General COX model

A COX model is built with the survival object (time = survival time; case = exitus) as the dependent variable and the following variables as independent variables:

  • Categorical: intervention, sex, zip code, hospital, hospital type dicharge, weekday, modified rankin scale, rank trace, holidays, weekend, prescriptions and comorbidities,

  • Numerical: age, jaccard similarity measure, duration trace, period, number of admission prior emergency, number of admission prior inhospital.

B.3 Propension to intervention model

In order to overcome the violation of proportional risk in COX model, 4 different propensity to intervention models are estimated, using as covariables those that have been found to be significant (pvalue < 0.05) in the general COX model constructed in the previous section.

  1. Propensity to fibrinolysis intervention model,

  2. Propensity to thrombectomy mechanic intervention model,

  3. Propensity to combined intervention model,

  4. Propensity to any intervention model

After building the models, the propensity to intervention for each patient is predicted using each model and the propensity score (PS) is calculated according to the formula: PS_i = 1/ (1- p(y)_i) where p(y) indicates the prediction for each patient i.

B.4 Model to predict exitus with PS as covariable

After calculating the PS for each intervention, a model is constructed for each intervention to predict exitus with the PS of this corresponding intervention as a covariate and finally a general model with the PS of each intervention as covariates and the PS of any intervention as an offset.

Results

These results have been carried out with synthetic data previously generated according to the data model.

Descriptive analysis

First, we imported and converted our DataFrame to an Event Log and check dates.

There are 46 errors in emergency dates, if there are errors they may be dates outside the limits or there may be dates without having entered the emergency
There are 0 errors in hospital dates, if there are errors they may be dates outside the limits or there are dates without having entered the hospital
There are 0 errors in hospital dates with emergency dates, the hospital admission date is between the emergency admission and discharge date (included).

As a descriptive measure of the data, a bar plot with the number of distinct traces and their frequency.

Figure 1: Bar plot with the number of distinct traces and their frequency

Survival analysis

The kaplan-meier curves for the 10 most common pathways and the COX model summary are shown.

<lifelines.CoxPHFitter: fitted with 133 total observations, 61 right-censored observations>
             duration col = 'survival_in_days'
                event col = 'status'
                penalizer = 0.1
                 l1 ratio = 0.0
      baseline estimation = breslow
   number of observations = 133
number of events observed = 72
   partial log-likelihood = -324.04
         time fit was run = 2023-11-16 11:32:47 UTC

---
            coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
covariate                                                                                                         
trace_1    -0.08       0.93       0.40            -0.86             0.70                 0.42                 2.02
trace_2     0.13       1.14       0.43            -0.70             0.97                 0.49                 2.64
trace_3     0.06       1.06       0.43            -0.77             0.90                 0.46                 2.45
trace_4     0.40       1.49       0.42            -0.43             1.23                 0.65                 3.41
trace_5     0.09       1.10       0.44            -0.77             0.96                 0.46                 2.61
trace_6    -0.49       0.61       0.46            -1.39             0.42                 0.25                 1.52
trace_7     0.10       1.10       0.44            -0.77             0.97                 0.46                 2.63
trace_8    -0.14       0.87       0.46            -1.03             0.75                 0.36                 2.13
trace_9    -0.14       0.87       0.48            -1.08             0.79                 0.34                 2.20
trace_10    0.07       1.07       0.44            -0.80             0.94                 0.45                 2.56

            cmp to     z    p   -log2(p)
covariate                               
trace_1       0.00 -0.19 0.85       0.24
trace_2       0.00  0.31 0.75       0.41
trace_3       0.00  0.15 0.88       0.18
trace_4       0.00  0.94 0.35       1.53
trace_5       0.00  0.21 0.83       0.27
trace_6       0.00 -1.05 0.29       1.77
trace_7       0.00  0.22 0.83       0.27
trace_8       0.00 -0.31 0.76       0.40
trace_9       0.00 -0.30 0.76       0.39
trace_10      0.00  0.15 0.88       0.19
---
Concordance = 0.58
Partial AIC = 668.07
log-likelihood ratio test = 3.92 on 10 df
-log2(p) of ll-ratio test = 0.07

Figure 2: Survival descriptive

A- Process mining

Process discovery

Inductive miner traces:

Figure 3: Inductive miner with only filtered traces

Conformance checking

Comparison of the 10 most frequent traces with the theoretical one, using Jaccard’s similarity method without taking into account the order:

The theorical trace is:

The Jaccard’s similarity measured for k=10 most frequent traces:

Also shown below is the histogram of the Jaccard similarity of the patient traces compared to the theoretical one

Figure 4: Petri net wiht points for decision mining

Decision mining

First we created a petri net with the inductive algorithm that will allow us to know the different decision points, we can see that the two images below are the same petri net, showing in the first one the decision points and in the second one the activities (transitions). In this section, in order to create a petri net with decision points, it was necessary to delete records from the event log.

Figure 5: Petri net wiht points for decision mining

Figure 6: Petri net with activities for decision mining

After that we created the petri net, we were able to see the importance of the features at decision/s point/s.

In this case, the point/s is/are:

The decision point for fibrinolysis in hospital is: p_7
The decision point for thrombectomy in hospital is: p_7
The decision point for thrombolysis in emergency is: p_19

If the calculation of the importance of the variables in the decision does not appear for any point, it is because the complete decision mining process has not been carried out for that point due to lack of information complexity of the petri net.

Prediction

The used method to make predictions is based on predicting the following activity, using the bupaR tool that makes use of a transformer model. Starting with predicting the next activity, the model’s evaluation is:

                       loss sparse_categorical_accuracy 
                       0.63                        0.77 

A plot is also shown that allows a clear view of the accuracy of the model:

Figure 7: Confusion matrix for predictions

B- Estimation of outcomes within a path

B.1 Kaplan-Meier survival plot

The kaplan-meier survival plot for the 4 intervention possibilities is shown below:

B.2 General COX model

A summary (table with all variables and a ggforest with only variables that are statistically significant, pvalue < 0.05 (if it is possible)) of the constructed COX model is shown below:

  surv_obj
Predictors Estimates CI p
intervention
[fibrinolysis]
1.37 0.98 – 1.90 0.065
intervention
[thrombectomy_mec]
1.01 0.65 – 1.55 0.973
intervention [combined] 1.07 0.78 – 1.47 0.674
hospital cd [220020] 1.50 0.80 – 2.81 0.202
hospital cd [220036] 0.63 0.31 – 1.30 0.211
hospital cd [220041] 1.31 0.70 – 2.48 0.399
hospital cd [220054] 0.97 0.53 – 1.77 0.908
hospital cd [220089] 1.15 0.54 – 2.42 0.721
hospital cd [220105] 1.37 0.75 – 2.50 0.303
hospital cd [440012] 1.28 0.70 – 2.35 0.427
hospital cd [440027] 0.64 0.33 – 1.25 0.189
hospital cd [440033] 0.73 0.34 – 1.56 0.416
hospital cd [440048] 0.77 0.41 – 1.44 0.412
hospital cd [500016] 0.68 0.33 – 1.41 0.306
hospital cd [500021] 1.18 0.63 – 2.20 0.608
hospital cd [500055] 1.12 0.61 – 2.08 0.710
hospital cd [500068] 1.26 0.66 – 2.41 0.488
hospital cd [500074] 1.01 0.55 – 1.87 0.966
hospital cd [500080] 0.79 0.39 – 1.63 0.528
hospital cd [500093] 1.40 0.72 – 2.73 0.324
hospital cd [500107] 1.06 0.55 – 2.02 0.867
hospital cd [500114] 1.30 0.69 – 2.43 0.417
hospital cd [500129] 1.04 0.54 – 2.02 0.906
hospital cd [500135] 0.87 0.44 – 1.72 0.692
hospital cd [500140] 0.82 0.40 – 1.69 0.594
hospital cd [500153] 1.22 0.66 – 2.27 0.531
hospital cd [500172] 0.79 0.39 – 1.63 0.529
hospital cd [500188] 0.68 0.34 – 1.36 0.277
hospital cd [500195] 1.06 0.56 – 1.98 0.866
hospital cd [500200] 0.96 0.51 – 1.80 0.901
hospital cd [500218] 0.67 0.32 – 1.41 0.294
hospital cd [500223] 1.38 0.74 – 2.57 0.316
hospital cd [999999] 0.75 0.39 – 1.42 0.375
age nm 1.00 1.00 – 1.01 0.080
sex cd [1] 0.96 0.74 – 1.24 0.737
sex cd [2] 1.01 0.78 – 1.31 0.927
sex cd [9] 1.12 0.87 – 1.44 0.377
zip code cd 1.00 1.00 – 1.00 0.184
hospital type discharge
cd [2]
1.00 0.71 – 1.40 0.980
hospital type discharge
cd [3]
0.96 0.68 – 1.34 0.800
hospital type discharge
cd [4]
1.09 0.78 – 1.54 0.609
hospital type discharge
cd [5]
1.04 0.75 – 1.44 0.809
hospital type discharge
cd [8]
0.82 0.57 – 1.16 0.262
hospital type discharge
cd [9]
1.05 0.76 – 1.45 0.784
n admission prior
emergency nm
1.00 1.00 – 1.00 0.944
n admission prior
inhospital nm
1.00 1.00 – 1.00 0.417
modified rankin scale cd
[1]
0.75 0.51 – 1.10 0.140
modified rankin scale cd
[2]
0.94 0.65 – 1.37 0.757
modified rankin scale cd
[3]
0.85 0.57 – 1.26 0.427
modified rankin scale cd
[4]
0.95 0.65 – 1.38 0.778
modified rankin scale cd
[5]
1.38 0.96 – 1.99 0.081
modified rankin scale cd
[6]
0.85 0.58 – 1.26 0.422
modified rankin scale cd
[7]
0.66 0.45 – 0.97 0.037
modified rankin scale cd
[8]
1.03 0.72 – 1.48 0.859
heart failure bl 0.98 0.82 – 1.17 0.829
hypertension bl 0.99 0.83 – 1.19 0.955
diabetes bl 1.16 0.97 – 1.39 0.103
atrial fibrillation bl 1.16 0.97 – 1.39 0.115
valvular disease bl 1.07 0.89 – 1.28 0.480
rank trace [10] 0.93 0.32 – 2.75 0.899
rank trace [2] 1.18 0.41 – 3.38 0.756
rank trace [3] 1.37 0.51 – 3.64 0.530
rank trace [4] 1.14 0.41 – 3.18 0.804
rank trace [5] 1.16 0.39 – 3.44 0.785
rank trace [6] 0.79 0.24 – 2.59 0.695
rank trace [7] 0.74 0.25 – 2.19 0.587
rank trace [8] 1.01 0.35 – 2.88 0.986
rank trace [9] 0.61 0.19 – 1.99 0.413
rank trace [otros] 1.13 0.55 – 2.31 0.739
jaccard similarity 0.02 0.00 – 0.41 0.011
dur trace 1.00 1.00 – 1.00 0.028
period 1.00 1.00 – 1.00 0.974
holiday bl 1.33 0.74 – 2.40 0.339
weekend bl 0.97 0.70 – 1.34 0.833
weekday [Monday] 1.20 0.86 – 1.68 0.278
weekday [Saturday] 1.09 0.78 – 1.54 0.603
weekday [Thursday] 1.10 0.80 – 1.52 0.569
weekday [Tuesday] 0.97 0.70 – 1.37 0.880
weekday [Wednesday] 0.87 0.62 – 1.22 0.408
Observations 1000
R2 Nagelkerke 0.080

A test to see if proportional hazard assumption is satisfied, can be seen from the overall summary, pvalue < 0.05 indicates that proportional hazard assumption is not satisfied. If the test calculation is not possible, this could occur when there are collinear variables in the COX model, or when there are too few events.

                                   chisq df       p
intervention                      4.0719  3   0.254
hospital_cd                      21.7062 30   0.865
age_nm                            2.7951  1   0.095
sex_cd                            3.3488  3   0.341
zip_code_cd                       1.0276  1   0.311
hospital_type_discharge_cd       88.6355  6 < 2e-16
n_admission_prior_emergency_nm    0.0216  1   0.883
n_admission_prior_inhospital_nm   0.2497  1   0.617
modified_rankin_scale_cd         12.2530  8   0.140
heart_failure_bl                  0.1763  1   0.675
hypertension_bl                   0.0919  1   0.762
diabetes_bl                       0.4710  1   0.493
atrial_fibrillation_bl            0.0114  1   0.915
valvular_disease_bl               3.5051  1   0.061
rank_trace                        8.7919 10   0.552
jaccard_similarity                1.1866  1   0.276
dur_trace                        78.9327  1 < 2e-16
period                           31.4424  1 2.1e-08
holiday_bl                        0.2912  1   0.589
weekend_bl                        2.2131  1   0.137
weekday                           1.8529  5   0.869
GLOBAL                          251.6357 79 < 2e-16

B.3 Propension to intervention model

The summary is displayed propensity to fibrinolysis intervention model:

  fibrinolysis intervention
bl
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 0.00 <0.001
modified rankin scale cd
[1]
1.38 0.75 – 2.56 0.301
modified rankin scale cd
[2]
1.75 0.96 – 3.22 0.070
modified rankin scale cd
[3]
1.28 0.69 – 2.40 0.438
modified rankin scale cd
[4]
1.19 0.64 – 2.25 0.582
modified rankin scale cd
[5]
0.92 0.49 – 1.73 0.792
modified rankin scale cd
[6]
1.09 0.58 – 2.03 0.792
modified rankin scale cd
[7]
1.24 0.67 – 2.28 0.492
modified rankin scale cd
[8]
1.15 0.63 – 2.09 0.651
jaccard similarity 563048019785.21 8400675095.77 – 44961970582413.91 <0.001
dur trace 1.00 1.00 – 1.00 0.675
Observations 1000
R2 Tjur 0.197

The summary is displayed propensity to thrombectomy mechanic intervention model:

  thrombectomy mechanic
intervention bl
Predictors Odds Ratios CI p
(Intercept) 1246152.23 50734.79 – 38266721.51 <0.001
modified rankin scale cd
[1]
1.47 0.69 – 3.23 0.322
modified rankin scale cd
[2]
0.68 0.27 – 1.66 0.397
modified rankin scale cd
[3]
0.36 0.12 – 0.99 0.057
modified rankin scale cd
[4]
0.64 0.24 – 1.60 0.348
modified rankin scale cd
[5]
0.83 0.35 – 1.93 0.665
modified rankin scale cd
[6]
0.94 0.39 – 2.26 0.896
modified rankin scale cd
[7]
0.87 0.39 – 1.95 0.733
modified rankin scale cd
[8]
0.86 0.39 – 1.92 0.707
jaccard similarity 0.00 0.00 – 0.00 <0.001
dur trace 1.00 1.00 – 1.00 0.568
Observations 1000
R2 Tjur 0.107

The summary is shown propensity to combined intervention model:

  combined intervention bl
Predictors Odds Ratios CI p
(Intercept) 48.27 8.98 – 264.74 <0.001
modified rankin scale cd
[1]
0.97 0.56 – 1.68 0.905
modified rankin scale cd
[2]
1.07 0.61 – 1.85 0.821
modified rankin scale cd
[3]
1.22 0.69 – 2.14 0.494
modified rankin scale cd
[4]
1.17 0.66 – 2.06 0.593
modified rankin scale cd
[5]
1.31 0.75 – 2.27 0.341
modified rankin scale cd
[6]
1.02 0.58 – 1.79 0.948
modified rankin scale cd
[7]
1.11 0.65 – 1.91 0.698
modified rankin scale cd
[8]
1.10 0.65 – 1.87 0.720
jaccard similarity 0.00 0.00 – 0.00 <0.001
dur trace 1.00 1.00 – 1.00 0.610
Observations 1000
R2 Tjur 0.029

B.4 Model to predict exitus with PS as covariable

  exitus bl
Predictors Odds Ratios CI p
(Intercept) 1.20 0.90 – 1.61 0.209
ps fibrinolysis 0.95 0.83 – 1.10 0.521
Observations 1000
R2 Tjur 0.000
  exitus bl
Predictors Odds Ratios CI p
(Intercept) 0.39 0.17 – 0.93 0.034
ps combined 1.84 1.11 – 3.05 0.018
Observations 1000
R2 Tjur 0.006
  exitus bl
Predictors Odds Ratios CI p
(Intercept) 0.51 0.25 – 1.03 0.061
ps thrombectomy mec 1.94 1.07 – 3.54 0.029
Observations 1000
R2 Tjur 0.005

Finally, a model is built to predict exitus with PS calculated in each of the previous models as covariable and PS any interaction as offset:

  exitus bl
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 0.00 <0.001
ps fibrinolysis 0.24 0.16 – 0.35 <0.001
ps thrombectomy mec 0.02 0.00 – 0.11 <0.001
ps combined 48.69 10.36 – 232.38 <0.001
Observations 1000
R2 Tjur 0.024